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1  INTRODUCnON 


1  Introduction 

Synchronization  has  been  an  area  of  research  in  computer  science  since  the  1960 ’s.  Although  synchro¬ 
nization  mechanisms  for  uniprocessors  are  well  understood,  the  same  cannot  be  srud  for  multiprocessors. 
Two  factors  increase  the  relative  cost  of  synchronization  on  multiprocessors:  network  coimnunication 
costs  and  resource  contention.  Since  synchronization  occurs  quite  frequently  on  multiprocessors,  it  is 
important  to  develop  high-throughput,  low-contention  synchronization  mechanisms.  In  this  paper  we 
describe  our  research  on  reader-writer  synchronization  for  shared-memory  multiprocessors. 

Recent  work  has  focused  on  developing  good  algorithms  for  mutual  exclusion  on  shared-memory 
multiprocessors  [2,  12,  14].  Lamport’s  work  [12]  concentrates  on  minimizing  the  number  of  remote 
memory  accesses;  unfortunately,  his  algorithm  leads  to  spinning  over  the  network  in  the  absence  of 
caches.  Anderson’s  lock  [2]  and  Mellor-Crummey  and  Scott’s  lock  [14]  are  similar;  they  both  use  fetch- 
and-ruid  (or  similar  operations)  to  eUminate  remote  spinning;  as  a  result,  their  algorithms  generate  very 
little  network  traffic. 

Reader- writer  locks  relax  the  constraints  of  mutual  exclusion:  a  given  reader-writer  lock  can  be  held 
by  multiple  readers,  but  can  only  be  held  by  one  writer  at  a  time;  in  addition,  a  given  lock  cannot  be  held 
simultaneously  by  both  readers  and  writers.  Such  a  lock  can  be  used  to  ensure  the  sequential  consistency 
of  shared  memory,  as  long  as  all  readers  and  writers  obey  the  locking  protocol.  Reader- writer  locks  are 
extremely  useful  for  multiprocessors,  since  shared  memory  is  a  common  abstraction  on  multiprocessors. 
Our  interest  in  reader-writer  locks  lies  in  the  fact  that  there  is  an  opportunity  for  parallelism  among 
readers  that  has  not  been  exploited  in  previous  work. 

Reader-writer  locks  were  first  described  by  Courtois  et  al.  [8]  in  1971.  Recent  papers  by  Mellor- 
Crummey  and  Scott  [14,  15]  have  presented  several  new  algorithms  for  reader-writer  synchronization 
in  shared-memory  multiprocessors.  Their  algorithms  are  based  on  the  use  of  atomic  operations  such  as 
‘Tetch-and-add”  and  “compare-and-swap” ,  and  are  extensions  of  their  work  on  algorithms  for  mutual 
exclusion  synchronization. 

Mellor-Crummey  and  Scott  call  their  algorithms  for  reader-writer  synchronization  “scalable” ;  unfor¬ 
tunately,  they  define  “scalable”  to  mean  that  the  throughput  of  lock  acquisition  remains  constant  as  the 
number  of  processors  that  request  the  lock  increases.  This  definition  of  scalability,  although  satisfactory 
for  mutual  exclusion  algorithms,  is  unsatisfactory  for  reader-writer  synchronization.  In  particular,  if  all 
(or  a  large  number  of)  processes  are  readers,  the  throughput  of  lock  acquisition  should  increase  linearly 
with  the  number  of  processors,  since  the  semantics  of  rer^-locks  does  not  preclude  multiple  readers  from 
acquiring  ^ocks  in  parallel. 

We  have  developed  two  new  algorithms  for  reader-writer  synchronization  that  tak^  advantage  of 
this  opportunity  for  parallelism.  Our  algorithms  provide  scalable  throughput  for  readers;  they  differ 
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in  the  tradeoffs  that  they  make  between  the  costs  for  readers  and  for  writers.  Using  Proteus  [4,  5,  9], 
a  multiprocessor  simulator,  we  compared  their  performance  with  two  standard  algorithms.  When  a 
large  proportion  of  lock  requests  are  reads,  one  of  our  algorithms  scales  substantially  better  than  recent 
algorithms.  Indeed,  our  results  show  that  previously  described  2dgorithms  [15]  do  not  scale  at  all. 

In  Section  2  we  describe  our  two  new  algorithms:  static  and  dynamic  reader-writer  locks.  In  Section  3 
we  describe  the  experiments  that  we  performed  to  compare  our  algorithms  with  two  other  algorithms. 
Section  4  evaluates  the  algorithms.  Section  5  discusses  related  and  future  work,  and  we  draw  conclusions 
in  Section  6. 

2  Our  Algorithms 

We  present  two  new  spin-waiting  algorithms  for  reader-writer  synchronization  on  shared-memory  multi¬ 
processors:  the  static  and  the  dynamic  algorithms.  These  algorithms  were  designed  to  handle  one  process 
per  processor;  however,  they  could  be  easily  modified  to  handle  multiple  processes  per  processor.  We 
view  network  delay  and  contention  as  the  primary  issues  in  achieving  high-performance  synchronization; 
the  additional  overhead  to  support  multiple  processes  per  processor  should  not  affect  the  relative  perfor¬ 
mance  of  our  algorithms.  Our  model  assumes  that  shared  memory  is  partitioned  among  the  processors; 
each  processor  has  “local”  shared  memory  that  can  be  accessed  without  going  over  the  network. 

All  of  our  algorithms  use  spin-waiting,  but  readers  in  the  static  and  dynamic  algorithms  spin  only  on 
local  memory.  We  use  “semaphore”  to  mean  a  word  of  memory  that  is  accessed  using  test-and-set  with 
exponential  backoff;  we  do  not  use  test-and-test-and-set,  as  we  do  not  assume  the  presence  of  caches.^ 
The  literature  describes  several  different  fairness  conditions  that  can  be  implemented  [3,  8, 16]:  first- 
come-first-serve,  reader-priority,  and  fair-readers  priority  are  but  a  few.  We  have  not  implemented  any  of 
these  conditions,  as  we  consider  fairness  to  be  a  secondary  concern.  The  addition  of  fairness  constraints 
would  involve  additional  synchronization,  which  would  probably  lead  to  a  reduction  in  performance. 

2.1  Static  algorithm 

Our  static  algorithm  allocates  one  semaphore  per  processor,  plus  an  extra  semaphore  that  acts  as  a  gate 
on  writers.  In  order  to  acquire  a  static  lock,  a  reader  acquires  a  local  semaphore,  whereas  a  writer  must 
acquire  all  of  the  semsqihores.  The  gate  is  present  to  keep  the  writers  from  interfering  excessively  with 
readers.  In  an  earlier  version  that  did  not  have  the  gate,  the  performance  of  the  first  reader  was  poor, 
because  all  of  the  writers  were  trying  to  acquire  its  semaphore.  Figure  1  outlines  the  structure  of  a  static 

'  We  place  a  liiiui  on  the  backoff;  without  a  limit,  we  quickly  overflowed  32  bits  in  moat  of  our  experiments. 
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2  OUR  ALGORITHMS 


writer  gate 

IHDcesses 


local 

semaphores 


R  -  reader 
W- writM 

Figure  1:  Static  reader-writer  lock 

lock:  ao  arrow  from  a  process  points  to  a  semaphore  that  the  process  must  acquire  in  order  to  acquire 
the  reader-writer  lock.  When  releasing  a  siatic  lock,  a  reader  simply  releases  its  local  semaphore;  a 
writer  releases  all  of  the  semaphores.  The  lock  cm  be  viewed  as  a  quorum  of  semaphores,  where  the 
read  quorum  is  1  and  the  write  quorum  is  n.  We  call  this  algorithm  static  since  the  quorum  is  static. 

The  static  lock  has  two  important  properties:  readers  do  not  interfere  with  each  other,  and  readers 
do  not  have  to  go  over  the  network  to  acquire  a  lock.  This  algorithm  performs  extremely  well  when  all 
lock  requests  are  reads:  a  reader  only  needs  to  acquire  a  local  semaphore.  Unfortunately,  the  fact  that 
readers  never  interfere  means  that  writers  must  do  a  substantial  amount  of  work  in  systems  with  many 
processors;  when  even  a  few  percent  of  the  requests  are  writes,  the  throughput  suffers  dramatically,  since 
a  writer  must  acquire  a  semaphore  on  every  processor  in  order  to  acquire  a  static  lock. 

2.2  Dynamic  algorithm 

Our  dynamic  algorithm  provides  each  processor  with  a  semaphore  and  a  bit  that  indicates  whether  the 
semaphore  is  “valid.”  Figure  2  outlines  the  structure  of  a  dynamic  lock  (note  that  the  reader  queue  and 
valid  list  are  pointers  to  shared-memory).  A  dynamic  lock  can  be  viewed  as  a  variant  of  the  static  lock 
that  keeps  track  of  the  active  readers  in  a  centrrdized  location;  this  saves  work  for  writers,  as  they  do  not 
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Figure  2:  Dynamic  reader-writer  lock 


have  to  contact  every  processor  on  every  lock  acquisition  and  release.  We  call  this  algorithm  dynamic 
since  the  write  quorum  is  dynamically  maintained. 

The  list  of  valid  sem^hores  is  protected  by  a  Mellor-Crummey  and  Scott  (MCS)  mutual  exclusion 
lock  [14].^  If  a  reader  tries  to  acquire  a  dynamic  lock  Bind  its  semaphore  is  valid,  it  can  just  acquire  the 
semaphore^;  otherwise,  it  must  acquire  the  MCS  mutex.  After  the  reader  has  acquired  the  mutex,  it 
checks  the  writer  bit.  If  there  is  an  active  writer  (i.e..  the  writer  bit  is  set),  it  adds  itself  to  the  reader 
queue,  releases  the  mutex,  and  waits;  otherwise,  it  2uids  its  semaphore  to  the  list,  marks  its  semaphore 
valid  and  acquires  it,  and  releases  the  mutex.  Waiting  is  implemented  using  semaphores:  the  waiting 
process  enqueues  the  address  of  its  local  semaphore,  sets  the  semaphore,  releases  the  mutex,  and  then 
spins  until  the  semaphore  is  cleared;  the  process  that  performs  the  wakeup  just  clears  the  semaphore. 
In  order  for  a  reader  to  release  a  dynamic  lock,  it  just  needs  to  release  its  local  semaphore. 

In  order  for  a  writer  to  acquire  a  dynamic  lock,  it  must  first  acquire  the  mutex  protecting  the  list.  If 

^Th«  MCS  mutual  excluaioa  lode  should  not  be  confused  with  their  reader-writer  lock,  althou^  they  are  very  similar. 

^However,  the  reader  must  chedi  the  validity  of  its  semaphore  after  acquiring  the  semaphore;  if  it  haa  become  invahd, 
the  reiuler  must  acquire  the  mutex. 
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3  EXPERIMENTS 


there  is  an  active  writer  (i.e.,  the  writer  bit  is  set),  it  puts  itself  on  the  writer  queue,  releases  the  mutex, 
and  waits.  Otherwise,  it  sets  the  writer  bit  and  releases  the  mutex.  The  writer  then  invalidates  all  of 
the  semaphores  on  the  valid  list,  clears  the  valid  list,  and  then  waits  for  each  semaphore  to  be  cleared, 
which  ensures  that  there  are  no  readers  with  locks.  The  writer  then  has  the  lock,  and  can  continue.  A 
writer  unlocks  a  dynamic  lock  by  first  reacquiring  the  mutex,  and  then  checking  if  there  is  a  waiting 
writer.  If  a  writer  is  waiting,  the  unlocking  writer  wakes  it  up,  and  then  releases  the  mutex.  If  there  is 
no  waiting  writer,  it  checks  if  there  are  waiting  readers.  If  there  are,  it  swaps  the  pointers  for  the  valid 
list  and  the  reader  queue,  makes  all  of  the  waiting  readers’  semaphores  valid,  clears  the  writer  bit,  amd 
releases  the  mutex;  if  there  are  no  waiting  readers,  it  just  clears  the  writer  bit  and  releases  the  mutex. 

Although  readers  can  interfere  with  each  other  in  this  algorithm,  they  do  not  always  do  so;  two 
readers  can  only  interfere  with  each  other  when  they  are  both  trying  to  acquire  the  MCS  mutex  to  make 
their  semaphores  valid.  The  reason  that  we  make  them  interfere  is  to  reduce  the  amount  of  work  that 
a  writer  must  do;  this  decreases  the  throughput  of  readers,  but  increases  the  throughput  of  writers  and 
the  overall  throughput.  In  addition,  although  a  reader  might  have  to  go  over  the  network  to  acquire  a 
lock,  it  does  not  have  to  go  over  the  network  if  its  local  semaphore  is  valid. 


3  Experiments 

We  compared  the  performance  of  various  algorithms  for  reader-writer  synchronization  using  the  Proteus 
simulator,  a  high-performance,  reconfigurable  multiprocessor  simulator  that  was  developed  at  MIT  [4,  5, 
9].  Proteus  is  an  execution-driven  simulator  that  interleaves  the  execution  of  an  application  program  with 
the  simulation  of  the  underlying  architecture.  This  structure  makes  it  possible  for  Proteus  to  achieve  a 
high  degree  of  accuracy,  which  has  been  confirmed  by  several  experiments  [7,  9].  We  configured  Proteus 
to  run  our  experiments  on  i-ary  1-cubes,  2-cubes,  and  S-cubes  of  various  sizes;  1,  2,  4,  8,  27,  64,  and 
125  processors.  Although  the  Proteus  simulator  does  allow  us  to  simulate  shared-memory  caches,  we  did 
not  do  so,  aa  we  did  not  want  the  choice  of  caching  scheme  to  affect  our  results.  In  addition,  we  wanted 
to  make  the  movement  of  data  as  explicit  as  possible. 

We  measured  throughput  under  various  different  percentages  of  read  requests:  0%,  50%,  75%,  95%, 
99%,  and  100%.  In  each  experiment,  every  process  executes  a  tight  loop;  at  each  iteration,  each  process 
chooses  whether  to  act  as  a  reader  with  percentage  p.^  When  any  process  completes  400  iterations,  the 
experiment  stops.  When  p  =  99,  however,  the  experiment  runs  until  1600  iterations  complete;  the  larger 
number  of  iterations  is  necessary  in  order  to  make  sure  that  each  process  completes  a  reasonable  number 

^  Given  a  chance  of  p  percent  that  an  operation  i«  a  read,  it  is  not  true  that  at  any  time  p  percent  of  the  processes  are 
readers:  p  must  be  weighted  by  the  time  a  reader  takes  to  acquire  and  releve  a  lock  relative  to  the  time  a  writer  takes. 


3.1  Existing  Algorithms 
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of  writes.  When  the  first  process  finishes,  we  stop  measuring  throughput  so  that  the  drop  in  contention 
(and  resulting  increase  in  throughput)  as  processes  finish  is  factored  out  of  our  results. 

The  last  parameter  that  we  varied  is  the  time  spent  holding  the  lock.  In  one  set  of  experiments, 
each  process  spends  0  cycles  holding  the  lock.  This  meetsures  how  the  algorithms  scale  at  the  highest 
contention  levels,  and  also  demonstrates  how  the  overhead  for  lock  acquisition  and  release  scales.  In 
the  other  set  of  experiments,  each  process  spends  50  cycles  holding  the  lock;  this  is  intended  to  model 
a  simple  system  where  each  process  spends  a  relatively  small  amount  of  time  holding  a  lock.  We  did 
not  vary  the  time  spent  outside  of  the  lock,  since  the  effect  of  that  would  be  equivalent  to  having  fewer 
processes  ju^cessing  the  lock. 


3.1  Existing  Algorithms 

We  compared  our  two  new  algorithms  with  two  “standard”  algorithms:  a  monitor-based  algorithm  and 
an  MCS  algorithm.  The  first  algorithm  is  a  simple  extension  to  a  monitor-based  reader- writer  lock  [3]; 
it  is  intended  to  model  a  very  simple  implementation  of  a  reader-writer  lock.  The  MCS  algorithm  is  the 
fair  version  of  the  MCS  reader-writer  lock  [15]. 


3.2  Monitor-based  algorithm 

Figure  3  outlines  the  structure  of  a  reader- writer  lock  based  on  a  monitor.  A  reader  that  requests  a 
monitor-based  lock  must  first  acquire  the  monitor,  which  is  implemented  as  a  semaphore.  It  then  checks 
the  writer  bit;  if  it  is  0,  the  reader  increments  the  reader  count  eind  releases  the  monitor;  otherwise,  it 
enqueues  itself  on  the  reader  queue,  releases  the  monitor,  and  waits.  In  order  to  release  a  monitor-based 
lock,  a  reader  must  acquire  the  monitor,  decrement  the  reader  count,  and  release  the  monitor.  If  the 
process  finds  that  it  is  the  only  reader  (i.e.,  the  number  of  readers  is  0  after  the  decrement),  it  wakes  up 
a  writer  and  sets  the  writer  bit  before  releasing  the  monitor. 

A  writer  that  requests  a  monitor-based  lock  must  first  acquire  the  monitor.  If  there  is  already  a 
writer,  or  if  there  are  any  readers,  the  writer  enqueues  itself  on  the  writer  queue.  Otherwise,  it  sets  the 
writer  bit  and  releases  the  monitor.  In  order  to  release  a  monitor-based  lock,  a  writer  must  acquire  the 
monitor,  clear  the  writer  bit,  and  then  release  the  monitor.  If  there  are  waiting  readers,  the  writer  wakes 
them  all,  and  sets  the  reader  count  appropriately;  if  there  are  no  waiting  readers  and  there  is  a  waiting 
writer,  it  wakes  that  writer,  and  sets  the  writer  bit. 
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p6 


Figure  3;  Monitor  lock 


3.3  Fair  Queue 

The  AfCS  algorithm  is  one  version  of  the  MCS  reader-writer  lock  [15];  it  can  be  viewed  as  an  extension 
of  a  monitor  lock  where  the  'monitor”  is  controlled  by  using  the  compare-and-swap  primitive.  The 
MCS  algorithm  is  fair  bec.'iuse  it  is  FIFO;  processes  are  allowed  to  run  in  the  order  in  which  they  are 
enqueued.  Although  Mellor-Crummey  and  Scott  avoid  creating  significant  network  traffic  with  their 
algorithm,  they  impose  two  requirements  on  processes:  there  is  still  serialization  to  aM;quire  a  lock,  and 
every  process  must  go  over  the  network  to  acquire  a  lock.^  Our  algorithms  attempt  to  avoid  imposing 
these  two  requirements,  and  in  doing  so  achieve  better  performance. 

4  Performance  Evaluation 

Figures  4-15,  which  were  produced  by  the  Proteus  simulator,  illustrate  the  results  of  our  experiments. 
Throughput  is  measured  in  iterations  completed  per  1000  machine  cycles:  an  iteration  consists  of  a  lock 

^These  requirements  ere  present  because  the  MCS  lock  is  so  extension  of  the  MCS  mutual  exdusioo  lock,  and  so  retains 
most  of  its  characteristics. 
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acquisition  followed  by  a  release.  Even  though  we  performed  our  experiments  on  a  completely  different 
platform  than  Mellor-Crummey  and  Scott’s,  it  is  important  to  note  that  the  behavior  of  their  algorithm 
that  we  observed  is  very  similar  to  what  is  presented  in  their  paper  [15]. 

4.1  Comparing  Lock  Overhead 

Figures  4-9  illustrate  how  the  various  algorithms  scale  when  processes  spend  0  cycles  inside  the  lock;  this 
demonstrates  how  the  overhead  of  lock  acquisition  and  release  affects  throughput.  Figure  4  demonstrates 
that  both  of  our  algorithms  scale  linearly  with  100%  readers,  whereas  the  MCS  and  the  moniior-based 
locks  do  not  allow  any  real  parallelism  among  readers  in  their  access  to  the  lock;  we  cut  the  graph  off  so 
that  the  data  for  the  monitor-based  and  MCS  algorithms  could  be  seen. 

Figure  5  demonstrates  that  the  static  algorithm  does  not  scale  well,  even  with  a  small  percentage 
of  writes,  because  a  writer  incurs  overhead  proportional  to  the  number  of  processors.  With  a  small 
number  of  processors,  a  writer  does  not  have  to  acquire  many  semaphores;  however,  as  the  number  of 
processors  increases,  the  throughput  of  acquiring  a  write-lock  decreases  dramatically,  and  becomes  a 
limiting  factor.  The  dynamic  algorithm  continues  to  perform  quite  well:  with  64  or  more  processors,  it 
achieves  a  throughput  that  is  an  order  of  magnitude  higher  than  that  achieved  by  the  MCS  algorithm. 

As  Figure  6  illustrates,  the  static  algorithm  does  not  scale  when  we  decrease  the  ratio  of  reads  to 
writes.  The  throughput,  of  writers  begins  to  dominate  the  curve;  the  shape  of  the  curve  begins  to  look 
like  1/n,  where  n  is  the  number  of  processors.  This  makes  sense,  as  the  latency  for  a  writer  increases 
linearly  with  n.  The  throughput  curve  for  our  dynamic  algorithm  also  begins  to  flatten;  we  do  not  get 
an  increase  in  throughput  as  the  number  of  processors  increases.  As  the  number  of  processors  increases, 
it  is  more  likely  that  any  process  is  performing  or  waiting  to  perform  a  write,  since  the  writer  latency 
increases;  the  throughput  of  writes  becomes  the  limiting  factor  for  total  throughput. 

We  can  see  from  Figure  7  that  the  MCS  algorithm  begins  to  outperform  the  dynamic  algorithm  as 
the  mimber  of  writes  becomes  substantial.  However,  the  dynamic  lock  is  within  a  constant  factor  of 
the  MCS  lock,  and  these  factors  are  larger  than  they  should  be,  as  we  did  not  work  hard  to  tune  our 
implementation  of  the  dynamic  lock.  For  example,  when  a  process  reads  the  valid  list,  it  performs  a 
remote  read  for  every  element  in  the  list.  A  more  efficient  implementation  would  read  the  list  in  larger 
blocks,  which  would  result  in  lower  latency  and  less  network  contention.  For  this  reason,  we  believe  that 
even  at  a  lower  percentage  of  reads,  the  performance  of  the  dynamic  lock  could  be  comparable  to  that 
of  the  MCS  lock. 

Figures  8  amd  9  demonstrate  that  as  the  number  of  writers  increases,  the  MCS  algorithm  performs 
even  better  relative  to  the  others.  This  is  to  be  expected,  since  our  algorithms  are  designed  to  take 
advantage  of  a  large  number  of  readers.  Interestingly,  the  dynamic  algorithm  performs  at  least  as  well 
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Figure  4;  Scalability  of  Reader-Writer  Lock  Algorithms  at  100%  reads,  0%  writes,  0  cycles  in  lock 

Throughput  vs  processors,  99%  reads 


monWof-baaad 

HAGS 

alaUo 

ciynamlo 


Fig  are  5:  Scalability  of  Reader-Writer  Lock  Algorithms  at  99%  reads,  1%  writes,  0  cycles  in  lock 


Throughput,  Itaratlara/IOOOcyciM  Throughput,  Hwationt/IOOO  cycles 


Throughput,  Henrtlona/IOOO  cycles  Throughput,  heratlons/IOOO  cycles 


12 


4  PERFORMANCE  EVALUATION 


Throughput  vs  processors,  50%  reads 


Procassors 


monitor-bassd 

MCS 

statio 

dynamio 


Figure  8:  Scalability  of  Reader- Writer  Lock  Algorithms  at  50%  reads,  50%  writes,  0  cycles  in  lock 
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Figure  9:  Scalability  of  Reader- Writer  Lock  Algorithms  at  0%  reads,  100%  writes,  0  cycles  in  lock 
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Throughput  vs  processors,  100%  reads,  50  cycles  in  lock 
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Figure  10:  Scalability  of  Reader- Writer  Lock  Algorithms  at  100%  reads,  0%  writes,  50  cycles  in  lock 
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Figure  11:  Scalability  of  Reader-Writer  Lock  Algorithms  at  99%  reads,  1%  writes,  50  cycles  in  lock 
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Throughput  vs  processors,  95%  reads,  50  cycles  In  lock 
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''igure  12;  Scalability  of  Reader- Writer  Lock  Algorithms  at  95%  reads,  5%  writes,  50  cycles  in  lock 

Throughput  vs  processors,  75%  reads,  50  cycles  In  lock 
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Figure  13:  Scalability  of  Reader- Writer  Lock  Algorithms  at  75%  reads,  25%  writes,  50  cycles  in  lock 
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igure  14;  Scalability  of  Reader- Writer  Lock  Algorithms  at  50%  reads,  50%  writes,  50  cycles  in  lock 

Throughput  vs  processors,  0%  reads,  50  cycles  In  lock 
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Figure  15:  Scalability  of  Reader- Writer  Lock  Algorithms  at  0%  reads,  100%  writes,  50  cycles  in  lock 


16 


5  RELATED  WORK 


as  the  monitor-based  algorithm,  even  with  0%  readers. 

4.2  Performance  Under  A  Synthetic  Load 

Figures  10-15  illustrate  how  the  various  algorithms  scaJe  when  processes  spend  50  cycles  inside  the  lock. 
This  modeb  a  more  realistic  situation,  where  processes  do  spend  some  time  holding  a  lock.  The  shapes 
of  the  throughput  curves  in  these  figures  are  very  similar  to  those  in  Figures  4-9,  and  we  can  draw 
similar  conclusions. 


5  Related  Work 

This  work  b  very  similar  to  work  in  distributed  filesystems  [6,  17],  although  the  thrust  b  different.  Our 
goal  b  to  increase  throughput  in  multiprocessors,  whereas  the  main  motivation  in  distributed  systems  has 
typically  been  to  avoid  the  latency  of  remote  lock  requests.  Burrows’  MFS  filesystem  [6]  uses  a  locking 
scheme  that  resembles  our  dynamic  lock,  where  a  central  server  keeps  track  of  all  of  the  processes  with 
read  locks.  The  Vaxcluster  system  [17]  provides  a  distributed  lock  manager  that  allows  a  more  general 
locking  semantics  than  reader-writer  synchronization;  its  structure  b  abo  similar,  in  that  each  resource 
has  an  associated  lock-manager  that  keeps  track  of  the  outstanding  locks  for  that  resource. 

Dbtributed  systems  also  deal  with  dropped  messages,  network  partitions,  and  processor  failures; 
however,  most  multiprocessor  systems  are  not  designed  to  deal  with  such  occurrences.  The  use  of 
time-based  expiration  of  locks  has  been  explored  by  Gray  and  Cheriton  as  a  means  of  dealing  with  fault- 
tolerance  [10]:  a  process  that  requests  a  lock  gets  a  lease,  which  expires  after  a  certmn  time.  It  would  be 
interesting  to  combine  such  a  mechanbm  with  the  dynamic  algorithm,  as  we  expect  that  fault-tolerance 
will  become  more  important  as  multiprocessors  scale. 

There  is  abo  some  similarity  between  thb  work  and  research  on  cache  coherence.  The  dynamic  lock 
can  be  viewed  as  directly  implementing  a  directory-based  cache  consistency  scheme  for  a  reader-writer 
lock  [1]:  the  valid  Ibt  is  analogous  to  a  directory  of  possible  readers.  Similarly,  the  static  lock  can  be 
viewed  as  generalizing  a  bus-based  (broadcast)  cache  consbtency  scheme. 

Recent  work  has  explored  two-phase  (spin,  then  block)  ^dgorithms  for  waiting  in  multiprocessors 
[11,  13].  This  work  has  shown  that  two-phase  algorithms  perform  better  them  pure  spinning  or  blocking, 
and  that  some  two-phase  strategies  perform  within  a  constant  factor  of  the  offiine  optimal  decision  to 
spin  or  block.  In  generalbing  our  algorithms  to  handle  multiple  processes  per  processor,  it  would  be 
very  useful  to  incorporate  this  work. 
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6  Conclusion 

We  have  experimented  with  two  new  algorithms  for  reader-writer  synchronization,  in  the  hope  that  the 
opportunity  for  parallelism  between  readers  would  allow  us  to  achieve  higher  throughput  than  existing 
algorithms.  Although  the  siaiic  algorithm  does  not  exhibit  better  performance  (except  when  there  are 
very  few  writes),  the  dynamic  algorithm  achieves  substsuitially  higher  throughput  and  better  scalability 
than  current  eJgorithms. 

In  environments  where  reads  are  a  high  percentage  of  lock  requests,  the  dynamic  algorithm  performs 
significantly  better  than  the  MCS  lock.  When  writes  are  a  significant  proportion  of  lock  requests,  the 
MCS  lock  performs  slightly  better.  This  is  to  be  expected,  for  as  the  percentage  of  writes  increases,  a 
reader-writer  lock  increasingly  behaves  like  a  mutex.  In  highly  parallel  systems,  it  is  likely  that  writes  to 
heavily  shared  memory  will  be  infrequent;  if  there  are  many  such  writes  in  an  application,  the  parallelism 
available  is  likely  to  be  very  low. 

It  would  be  useful  to  find  analytic  bounds  on  the  performance  of  reader-writer  locks;  we  could 
then  evaluate  in  a  more  absolute  sense  how  the  dynamic  lock  performs.  We  actually  designed  another 
algorithm  that  uses  combining-tree  techniques;  we  intended  this  algorithm  to  be  more  ad^tive  than  the 
dynamic  algorithm.  Unfortunately,  this  tree-based  algorithm  always  performs  worse  than  the  dynamic 
algorithm.  We  are  investigating  the  possibility  of  improving  this  algorithm;  in  particular,  the  use  of 
blocking  instead  of  spinning  may  help. 

An  important  implication  of  our  research  is  that  the  design  of  an  algorithm  should  take  into  account 
the  semantics  of  the  problem.  The  MCS  lock  does  not  take  advantage  of  the  differences  in  semantics 
between  reader-writer  synchronization  and  mutuad  exclusion;  as  a  result,  the  throughput  that  it  achieves 
for  readers  does  not  scale.  Our  results  suggest  that  significant  performance  advamtages  can  be  obtained 
by  paying  careful  attention  to  the  semantics  of  synchronization  primitives. 
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